A deep catalogue of protein-coding variation in 983,578 individuals

Rare coding variants that substantially affect function provide insights into the biology of a gene1–3. However, ascertaining the frequency of such variants requires large sample sizes4–8. Here we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. In total, 23% of the Regeneron Genetics Center Million Exome (RGC-ME) data come from individuals of African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The catalogue includes more than 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss of function (LOF), we identify 3,988 LOF-intolerant genes, including 86 that were previously assessed as tolerant and 1,153 that lack established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions that are depleted of missense variants despite being tolerant of pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this resource of coding variation from the RGC-ME dataset publicly accessible through a variant allele frequency browser.


Article
a single harmonized sequencing and informatics protocol.Previously published datasets, such as the UK Biobank and the Mexico City Prospective Study, were reprocessed 9,25 .The RGC-ME dataset comprises both outbred and founder populations spanning African (AFR), European (EUR), East Asian (EAS), Indigenous American (IAM), Middle Eastern (MEA) and South Asian (SAS) continental ancestries, and includes cohorts with relatively high rates of consanguinity.More than 190,000 of the unrelated participants (23%) are of non-EUR ancestry in the RGC-ME dataset, as compared with 35,000 in gnomAD genomes (v.3.1.2),53,000 in gnomAD exomes (v.2.1.1),and 91,000 in TOPMed Freeze 8, indicating that RGC-ME represents a large increase in the number of individuals of non-EUR ancestry in datasets of genetic variation 4,8 (Fig. 1a and Supplementary Table 1a).
We performed a comprehensive survey of genetic variation, encompassing single-nucleotide variants (SNV) and insertion-deletion (indel) variants.To estimate population allele frequencies, we focused on 821,979 unrelated samples (referred to hereafter as the 822K unrelated set; Supplementary Table 1a).We identified 16,425,629 unique mutated genomic positions (that is, sites) in autosomal and X-chromosomal coding regions, with one unique reference-alternate allele change (that is, variant) every two bases on average.In canonical transcripts within sequencing target regions, mutations at 35.6%, 32.2% and 9.5% of all possible genomic positions that can lead to synonymous, missense and stop-gained variants, respectively, were observed.In highly methylated CpG sites, we observed 95.0% of all possible synonymous, 92.2% of missense and 78.6% of stop-gained variants.Across all mutational contexts, 21.4% and 8.4% of all possible synonymous variants and stop-gained variants, respectively, were observed (Extended Data Fig. 1).Thus, RGC-ME represents a major advance towards the comprehensive discovery of rare variants.
Among coding variation in canonical transcripts, 1,115,116 pLOF variants were identified, which include those causing a premature stop, affecting essential splice donor and acceptor sites or causing frameshifts (Fig. 1c).Of these pLOF variants, 53.3% were observed as singletons; that is, only observed in one individual.In addition, 4,645,092 synonymous (35.7% as singletons) and 10,444,562 missense (40.0% as singletons) variants in canonical transcripts were detected.A total of 48% of coding variants in canonical transcripts were unique to RGC-ME and absent in other large-scale datasets 4,8 (Fig. 1b).Each sample had a median of 137 pLOF, 8,652 missense and 10,184 synonymous variants (Fig. 1c).AFR individuals had, on average, 18.6% more variants across all functional categories compared with individuals of other ancestries (Extended Data Fig. 2), as expected on the basis of the 'Out of Africa' model of human population history 26 .

Constrained genes
Population-scale sequencing allows the quantification of pLOF variation in genes, which is key to understanding the relationship between genes and diseases.Several gene constraint metrics have been developed to estimate the pLOF tolerance of genes 27 .Here, we estimated pLOF depletion using the cumulative frequency of pLOF variants in a gene to derive a selection coefficient, s het , that quantifies fitness loss due to heterozygous pLOF variation 28 .We estimated the indispensability of 16,710 protein-coding genes on the basis of the observed number of rare pLOF variants per gene with a cumulative alternate allele frequency (AAF) of less than 0.1% compared with the expected number based on gene-specific mutation rates (Supplementary Table 2).The mean s het value in the RGC-ME dataset for canonical transcripts was 0.073 (95% highest posterior density (HPD)): [0.043, 0.12] (median s het = 0.021) (Fig. 2a), which suggests that, on average, a pLOF would result in 7.3% lower evolutionary fitness relative to the reference allele.This estimate is comparable with the mean s het value of 0.073 [0.029, 0.18] (median = 0.028) computed using the same method on the ExAC dataset 29 (n ≈ 60,000).Our sample size (n ≈ 822,000) helped to accurately quantify rare pLOF variants and compute more precise constraint scores than the ExAC values.This finding is best illustrated in known haploinsufficient genes, which are expected to be more constrained and thus have larger s het values relative to all genes (Extended Data Fig. 3a).Compared with values that were computed with ExAC data, s het values for haploinsufficient genes in the RGC-ME dataset were significantly higher ( s Δ =0.045 het , P = 0.002) and had smaller 95% HPD ranges despite those larger means (∆Var(s het ) = −0.026,P = 4.5 × 10 −21 ).Estimates for all genes were more precise in 822K samples compared with a randomly downsampled set of 60,000 samples from the RGC-ME dataset (Extended Data Fig. 3b), in which mean and median 95% HPD ranges were 6.2-and 4.0-fold larger, respectively.
The s het value is higher in genes that are associated with Mendelian diseases 28,30 (Extended Data Fig. 3a), and can differentiate groups of genes under varying degrees of selection (Fig. 2b).We used s het to identify constrained genes by comparing the s het scores of known high-constraint genes (haploinsufficient, autosomal dominant and developmental-specific autosomal dominant) with those of low-constraint genes (haplosufficient and genes with rare biallelic pLOF variants from the RGC-ME dataset) (Extended Data Fig. 4a).Among 1,476 genes in the 'high-constraint' and 3,893 genes in the 'low-constraint' groups, 89.1% of genes with a s het score greater than the mean (0.073) and 66.6% of genes with a s het score greater than the median (0.021) belonged to the high group (Supplementary Table 2).These thresholds served as cut-offs for mean and lower bound (2.5% HPD), respectively, to identify highly constrained genes with fitness deficits on a par with dominant disease-causing genes that also reflect uncertainty in the mean.
We compared s het to other published LOF constraint measures, such as LOEUF 4 , and an alternate method for estimating s het based on approximate Bayesian computing 31 , which we refer to as s het-ABC (Supplementary Figs.2-4).Spearman rank correlations between s het from RGC-ME and these estimates were high (−0.768with LOEUF; 0.778 with s het-ABC ).However, s het derived from RGC-ME had higher sensitivity and specificity in differentiating between constrained and unconstrained genes, compared with LOEUF and s het-ABC (Extended Data Fig. 4a).
Improving s het estimates is most valuable for genes with few expected pLOF mutations 4 , particularly shorter genes.The RGC-ME dataset allowed s het to be estimated more precisely for the smallest quantiles of gene coding sequence (CDS) length (Extended Data Fig. 3c), and derived more informative constraint metrics using an allele-frequency-based approach and a larger sample.We derived constraint scores for 923 genes that had 5 or fewer expected pLOF variants, deemed underpowered for similar analyses with LOEUF 32 .These 923 underpowered genes were significantly shorter, with a mean CDS length of 573 base pairs compared to 1,797 base pairs for genes with more than 5 expected pLOF variants.Eighty-six genes were highly constrained, with a mean s het value greater than 0.073 and a lower bound greater than 0.021 (Extended Data Fig. 4b), and are promising candidates for efforts to discover new disease-associated genes.Thirty per cent (26 of 86) have been linked to human diseases or shown to be essential in mice or cell lines (Supplementary Table 2).These include well-studied genes with known importance in cellular function, such as the transcription factor TWIST1 (ref.33), DNA-and RNA-binding protein BANF1 (refs.34,35) and transactivator CITED2 (ref.36).
Overall, 3,988 highly constrained genes had s het values greater than 0.073 and a lower bound greater than 0.021.Although 1,153 of these lack known associations with human diseases or lethal mouse knockout phenotypes, they are likely to have high functional importance.These constrained genes might lack disease associations because the loss of even a single copy is incompatible with life or causes reduced reproductive success without clinical disease 37 .

Constrained coding regions
Identifying sub-genic regions that are intolerant of mutations can reveal functionally important regions that would otherwise be missed when constraint scores are aggregated at the gene level.Models of local coding constraint are powerful tools for identifying protein domains with crucial functions and for variant prioritization [38][39][40] .In addition to gene constraint derived from pLOF variation, we also identified regions depleted of missense variation using the missense tolerance   40,41 , defined as the ratio of the observed to the expected proportion of missense variants adjusted by synonymous variation in a defined codon window.Using the 822K unrelated samples, we calculated the MTR for each amino acid along the CDS within sliding windows of 21 and 31 amino acids (MTR scores available on figshare; see 'Data availability') and characterized continuous segments of missense constrained regions (Supplementary Table 3).
Compared with benign missense variants, ClinVar pathogenic missense (two stars or more) variants were highly enriched in the top percentile of exome-wide MTR scores (odds ratio = 100.0and 89.8, computed with 21-and 31-codon windows, respectively; Fig. 3a).Our sample size, which is nearly four times larger than that used in previous MTR estimates 41 , resulted in improved discrimination between pathogenic and benign variants for top-10-percentile MTR scores in which we observed significant enrichment (Fig. 3a).This larger sample size enabled us to identify 24% more missense variants in the top-1-percentile constrained MTR scores (512,499 versus 413,147) compared with a subsampled set of 225,000, after adjusting for a false discovery rate (FDR) lower than 0.1.In addition, the increased power derived from 822K samples resulted in higher resolution for distinguishing pathogenic from benign variants for MTR computed with 21-codon windows, albeit at the expense of having fewer scored missense variants overall (295,958 constrained missense variants).
Deleterious variants are expected to have lower allele frequencies than neutral variants, owing to negative selection.We can infer the functional importance of different classes of variation by comparing the proportion of singletons in each class.We computed the deleteriousness of variants using an updated mutability-adjusted proportion of singletons (MAPS) metric 5,32 and derived an MTR score threshold at which their MAPS score corresponds to that of missense variants that were predicted to be deleterious by five out of five prediction algorithms in dbNSFP (v.3.2;see Supplementary Information); that is, 5/5 missense variants.Variants with MTR values in the top-15-percentile exome-wide threshold (MTR < 0.841) were predicted to be as deleterious as 5/5 missense variants (Extended Data Fig. 5a).For 31-codon windows, 1.24% (129,990) of all missense variants (excluding known ClinVar pathogenic variants) observed in the RGC-ME dataset had significant MTR scores in the top 15 percentile.These missense variants in the top 15 percentile of exome-wide MTR are potentially deleterious and could be suitable for prioritization in projects aiming to discover disease-associated genes.MTR is a useful metric of regional constraint that may capture functionally important segments within genes.We defined MTR-constrained regions as continuous regions within a protein that have variants with MTR values in the top-15-percentile threshold (Supplementary Information and Extended Data Fig. 6a).We identified 41,114 missense constrained regions in 12,349 genes (Supplementary Table 3).Our findings overlap with results from a previous study 38 that estimated the regional observed-to-expected missense ratio (ɣ) from around 60,000 ExAC samples (Extended Data Fig. 6b) to derive a composite missense deleterious score called MPC.We refer to the MPC-derived constrained regions as MPC segments, and compared these with MTR-constrained regions.MTR-constrained regions had a median length of 22 residues [14-35, quartile 1-quartile 3], compared with 358 [208, 579] in MPC segments.Overall, we identified 8.59 times more MTR-constrained regions than MPC segments (ɣ ≤ 0.612, top 15 percentile) across 2,832 transcripts with data from both methods (Extended Data Fig. 6c).
We examined the distribution of de novo missense variants in MTR-constrained regions and observed a significant enrichment (P = 2.61 × 10 −10 ) of variants identified in individuals with neurodevelopmental disorders (Extended Data Fig. 7a,b).Case variants were 1.85 times [1.50, 2.31 (95% CI)] more likely to occur in constrained regions, compared with controls.As expected, well-supported (two stars or more) ClinVar pathogenic missense variants were also highly enriched (P ≈ 0) in MTR-constrained regions.Pathogenic variants were 8.82 times [8.17, 9.53 (95% CI)] more likely to occur in missense constrained regions than were benign variants.
Missense constrained sites were found in key functional regions, such as DNA-binding regions and active sites (Fig. 3b).Among membrane proteins, transmembrane regions ranked higher in MTR-constrained regions than did cytoplasmic and extracellular domains.We also compared the overlap of MTR-constrained and functional regions by computing Jaccard indices.Ubiquitin-conjugating (UBC) core domains and DNA-binding regions had the highest overlap with constrained regions ( Jaccard index = 0.52 and 0.18, respectively), suggesting that, among UBC enzymes, more than half of the union set between MTR-constrained regions and core domains overlapped.Other enriched functional regions included protein kinases and nuclear receptor ligand-binding domains (Supplementary Table 4).
A total of 4,064 genes contained regions depleted in missense variation with a significant proportion of their coding sequence in the top 15 percentile of MTR (binomial test with π 0 = 0.15, P < 0.05 after multiple testing correction; Supplementary Table 5).To identify genes with signatures of missense-only constraint, we assessed the LOF-constraint metric, s het , of these highly missense constrained genes (Fig. 3c).Among the 4,064 genes, 1,482 either were not LOF constrained or lacked s het estimates.These genes had significantly shorter CDS lengths than those of the 1,424 LOF-specific constrained genes (P = 2.9 × 10 −40 , Wilcoxon test; Extended Data Fig. 7c).Estimating region-level LOF constraint is difficult owing to strong selection against pLOF variants, which leads to a paucity of pLOF variation.MTR serves as a complementary lens for identifying, first, functionally important regions at a higher resolution than gene-level LOF constraint, and, second, regions within genes that are depleted of missense variation but tolerant of LOF variation.For example, KRAS, a well-known oncogene, is LOF tolerant (s het = 0.002, LOEUF = 1.24); however, the first 80 amino acids (42%) of the protein sequence were ranked in the top 1 percentile of exome-wide MTR (Fig. 3d).This region includes the P-loop, switch 1 and switch 2 functional domains, which form crucial binding interfaces for effector proteins 42 , and these results therefore highlight the importance of regional constraint metrics.

Understanding 'human knockouts'
Identifying genes with biallelic pLOF variants provides an opportunity to understand gene function directly through the phenotypic characterization of individuals who have such variants-effectively, naturally occurring 'human knockouts'.The RGC-ME dataset includes founder populations and cohorts with high rates of consanguinity, contributing to a comprehensive collection of homozygous loss-of-function variation 25,[43][44][45] .Overall, we identified 4,686 genes comprising 8,576 rare (AAF < 1%) homozygous pLOF variants in 64,852 individuals (Supplementary Table 6).Furthermore, we identified 1,205 genes with carriers of rare (AAF < 1%) heterozygous pLOF variants in trans; that is, compound heterozygotes, 162 of which lacked homozygous pLOFs.In total, 4,848 genes were discovered with carriers of biallelic pLOF variants in which both alleles of a gene were affected by pLOF variation and could be described as putative gene knockouts (pKOs).Of these, 1,751 (1,650 from homozygous pLOFs only) have not to our knowledge been previously reported.Biallelic pLOF variants in RGC-ME are rare; 64.3% of homozygous pLOF variants and 37.4% of pKOs were detected in one participant (Fig. 4a).As expected, cohorts with higher rates of consanguinity were enriched in homozygous pLOF variants, compared with outbred populations, despite smaller sample sizes (Fig. 4b,c and Extended Data Table 1).
pKOs were significantly less constrained, with a lower s het (on average −0.074 [−0.077, −0.071 (95% CI)], t-test) relative to all other genes.Only 2.67% of pKOs had an s het value greater than 0.073, as compared with 21.6% of all human genes, and 47.2% of pKOs were in the lowest quintile of s het scores exome-wide (s het < 7.07 × 10 −3 ).A caveat is that s het , like most gene-specific measures of constraint, is designed to capture the effect of heterozygous LOF 46 .Although genes containing biallelic pLOF variants are under less heterozygous selective pressure, existing sample sizes are inadequate 47 to directly compute selection on homozygous variation.pKOs are overrepresented in drug and xenobiotic metabolism pathways (Supplementary Fig. 6).
Among very rare doubleton variants for which we observed exactly two copies of the alternate allele, we observed a clear excess of Carriers of homozygous pLOFs and compound heterozygous variants were included in this analysis.b, Breakdown of the number of unique pKO genes observed in the RGC-ME dataset by ancestry.Both sets of rare biallelic variantshomozygous pLOFs and compound heterozygous-were included in this analysis.See Extended Data Table 1a for a breakdown by ancestry of each type.c, Projected accrual of pKO genes using homozygous pLOF variant data at hypothetical cohort sizes for each ancestry in 983,578 related individuals.Curves reflect the accrual of the expected number of genes with at least one, at least five and at least ten carriers, respectively, of a rare, homozygous pLOF.Asterisk denotes the inclusion of cohorts with a high rate of consanguinity.

Article
homozygotes that is likely to be explained by population structure and background inbreeding.For example, among missense and synonymous variants, we observed 5,857 and 2,490 homozygotes among 1,580,917 and 679,335 doubleton variants, respectively, compared with a Hardy-Weinberg equilibrium (HWE) expectation of fewer than one homozygote in each case.These estimates corresponded to a background inbreeding coefficient of 0.37%.Among pLOF variants, we observed only 406 homozygotes among 129,405 doubleton variants (Supplementary Table 7).Although this number is much larger than HWE expectations, it is around 15% less than the expected 479 homozygotes calculated using an inbreeding coefficient of 0.37% (P = 0.0095, Fisher's exact test).This suggests that a notable proportion of these homozygotes were never observed in our sample population.
Genes with biallelic inactivating mutations could reveal potential drug targets that can be disrupted with minimal side effects 43 .Drug targets with homozygous pLOF variants in humans are more likely to progress from phase I trials to approval 44 .Of 997 inhibitory preclinical targets listed in the Drug Repurposing Hub, 182 (18.3%) had at least one individual with a rare biallelic pLOF variant in the RGC-ME dataset 48 .In-depth phenotyping of human knockouts can help researchers to better understand the efficacy and side-effect profiles of these potential drug targets.Human knockouts provide a way to understand the consequences of lifelong deficiency of a gene 49 .

Annotation of splice-affecting variants
Several prediction tools [50][51][52][53] have been developed to understand the effects of genetic variants on alternative splicing.Although these tools mainly assess whether a variant affects splicing, some also provide a pathogenicity metric or score threshold as a measure of deleteriousness.Predicted cryptic splice sites with SpliceAI scores greater than 0.8 have been validated at high rates using RNA sequencing and are as depleted at common allele frequencies as pLOF variants 50 .Here, we used human genetic data to optimize splice prediction score thresholds enriched for deleterious variants that affect splicing.We systematically quantified the deleteriousness of variants at various splice prediction score thresholds using the MAPS metric.As previously demonstrated 2,4 , pLOF variants had the highest MAPS scores, followed by missense, synonymous and noncoding variants, respectively (Fig. 5a).
We used splice predictions from SpliceAI 50 and MMSplice 51 to group variants into predicted splice score bins, and identified the minimum threshold at which the MAPS score of the variants is equal to that of 5/5 missense variants (variants predicted to be deleterious by five out of five prediction methods).The proposed prediction score thresholds of 0.35 for SpliceAI and 0.97 for MMSplice pathogenicity (Fig. 5a and Extended Data Fig. 5b) identify predicted deleterious splice-affecting variants (SAVs).
A total of 296,696 predicted deleterious coding SAVs (inclusive of canonical splice sites, splice region and untranslated region (UTR) splice sites) in the RGC-ME dataset had scores that exceeded the MAPS-derived splicing thresholds for both SpliceAI and MMSplice (referred to as the intersection set; Extended Data Fig. 8a).Of these, 43.5% (129,118) were cryptic splice sites (that is, non-canonical splice sites).Unsurprisingly, canonical splice sites and variants within the splice region comprised the largest category of predicted deleterious SAVs.Both SpliceAI and MMSplice identified around 80% of LOFTEE (loss of function transcript effect estimator; ref. deleterious SAVs (Extended Data Fig. 8a,b).In addition, around 68% of LOFTEE low-confidence splice sites were predicted to be deleterious SAVs (94% of low-confidence splice sites were in the UTR).The impact of non-canonical splice variants on alternative splicing is often underestimated; we found that missense variants accounted for 11.3% of all predicted deleterious SAVs identified by both SpliceAI and MMSplice in the RGC-ME dataset (Extended Data Fig. 8a,b).Predicted deleterious SAVs were enriched in well-supported ClinVar pathogenic variants (two stars or more) compared with other metrics of variant deleteriousness (Fig. 5b); for example, compared with combined annotation dependent depletion (CADD) 54,55 score ≥ 20 (odds ratio = 4.5, P = 0).Missense SAVs were significantly enriched for pathogenic variants compared with 5/5 missense variants (odds ratio = 1.8, P = 3.3 × 10 −8 ) and missense variants with CADD ≥ 20 (odds ratio = 3.6, P = 1.01 × 10 −26 ), respectively.Notably, splice sites in the intersection set were also significantly enriched for pathogenic variants compared to LOFTEE high-confidence splice sites, indicating that the MAPS-derived metric identifies deleterious splice sites.Similar results were obtained when we evaluated the enrichment of pathogenic variants compared with benign ones (Supplementary Table 8a).
We next assessed the MAPS-derived splice prediction thresholds for variants that have been experimentally assessed for splicing effects [56][57][58] (Supplementary Table 9).Predicted deleterious SAVs identified in the intersection set were significantly enriched in experimentally validated large-effect splice-disrupting variants (SDVs) compared with non-SDVs in all functional categories except the splice site category, although the odds ratio was greater than one for splice sites (Fig. 5c).Variants of unknown significance (VUSs) in ClinVar that were predicted as deleterious SAVs were also significantly enriched in experimentally validated SDVs (Fig. 5c).Of the 563 predicted deleterious SAVs assayed in the experimental data, 346 (61.5%) were SDVs and more than half were cryptic splice sites, including 13 ClinVar VUSs (Extended Data Fig. 8d).
We also derived stringent thresholds to identify SAVs by removing canonical splice sites and calibrating exclusively coding non-splice-site (nonSS) variants to a MAPS score comparable with 5/5 missense variants.These thresholds corresponded to a SpliceAI score of 0.43 and an MMSplice score of 0.97 (Extended Data Fig. 5c).Pathogenic enrichment was consistent when comparing deleterious coding nonSS and missense SAVs with corresponding variant categories filtered by CADD ≥ 20 (Supplementary Table 8b,c).Consistent results were also obtained when comparing the enrichment of deleterious SAVs in SDVs to non-SDVs after applying thresholds for coding nonSS variants (Extended Data Fig. 8c,e).

Clinical utility of rare variants
To understand the prevalence of disease-associated alleles in the general population, we identified well-supported ClinVar 59 pathogenic variants (two stars or more) across 2,042 genes in 822K unrelated RGC-ME samples.We found that 40.7% of pathogenic variants (20,343/49,990)  were observed in the RGC-ME dataset, of which 99.6% (20,262) had an AAF of less than 0.1% and 17.8% (3,619) were observed once.In comparison, 20% (9,821) and 29% (14,700) of pathogenic variants were observed in ExAC exomes (n ≈ 60,000) and gnomAD v.2.1.1 exomes (n ≈ 126,000), respectively (Extended Data Fig. 9a).This highlights the importance of the RGC-ME dataset's larger sample size in identifying rare pathogenic variants.On average, individuals carry 1.58 pathogenic variants, with the majority of these individuals being heterozygous carriers of these variants.Specifically, 61.4% of the 822K unrelated individuals were heterozygous carriers of pathogenic recessive alleles in 1,143 of 2,659 known autosomal recessive genes (mean, 0.98 pathogenic alleles per person); 0.21% of the samples were homozygotes of pathogenic variants in 167 autosomal recessive genes; and 3.64% were heterozygous carriers of 353 of 1,629 total autosomal dominant genes.Pathogenic variant annotations should be interpreted cautiously owing to the incomplete penetrance of disease alleles 60 .
The American College of Medical Genetics identified a set of genes (ACMG SF v.3.1) with clinically actionable variants that predispose individuals to diseases and for which medical interventions are available to reduce mortality and morbidity 61 .Among the 822K unrelated individuals, 22,846 (2.77%) had at least one ClinVar-reported (two stars or more) pathogenic missense or pLOF variant for 72 out of 76 autosomal genes on the ACMG list (Supplementary Table 10).As expected, two of the most prevalent pathogenic variations were the HFE Cys282Tyr allele (enriched in EUR, n EUR-homozygotes = 3,220 and AAF EUR = 13.8%) and the TTR Val142Ile allele (enriched in AFR, n AFR = 1,670 and AAF AFR = 3.4%).
We also tallied carriers of likely pathogenic pLOF variants (novel variants not yet reported as pathogenic in ClinVar) in 44 genes in which truncation is known to lead to disease.A total of 2,357 (0.3%) individuals in the RGC-ME dataset carried 1,407 likely pathogenic variants across 40 of these genes.In total, 3.06% of the individuals in the RGC-ME dataset were carriers of pathogenic or likely pathogenic variants.Excluding individuals with high-frequency pathogenic variants in the HFE (Cys282Tyr) and TTR (Val142Ile) genes, 2.38% of the individuals in the RGC-ME dataset carried an actionable variant (Supplementary Table 10).This number is comparable with those from other reports 6,7,62 of actionable variants, which range from 2% to 4.1% for gene sets that include ACMG v.2.0 and v.3.0.As expected, pathogenic variants are rare in large-scale studies of the general population.We found that 39% and 79% of pathogenic and likely pathogenic variants, respectively, were singletons.Focusing on non-ACMG genes, we found that 1.27% of individuals were heterozygous carriers of pathogenic variants in autosomal dominant genes, and 0.21% were homozygotes of pathogenic variants in autosomal recessive genes.
Because the RGC-ME dataset includes uniformly processed exome data from a relatively large proportion of individuals from continental ancestries other than EUR, we assessed the range of allele frequencies of variants present in ClinVar across four continental populations: AFR, EUR, IAM and SAS.Approximately 34% of unique pathogenic coding variants in equalized subsamples were observed only in individuals of non-EUR ancestry, which indicates that sampling diverse populations is necessary for the comprehensive identification of rare variation.Across all unrelated individuals, on average, those of EUR ancestry had 63% more pathogenic variants that were well characterized (rated two stars or more) per sample than did individuals of AFR ancestry.Conversely, individuals of EUR ancestry had, per sample, 25.6% fewer VUSs (Extended Data Fig. 9b,c) and 18.6% fewer variants across all functional types (Extended Data Fig. 2).In individuals of AFR ancestry, a consistent pattern of significantly fewer high-confidence (two stars or more) pathogenic variants (−0.576 [−0.567, −0.585 (95% CI)], t-test) to a surplus of VUSs (42.13 [421.97,42.28]), compared with individuals of EUR ancestry, suggests that the most wellcharacterized pathogenic variants were depleted in this population (Extended Data Fig. 9b).Recruiting diverse individuals to enable the identification and characterization of novel pathogenic variants might help to address this ascertainment bias.Further analyses of pathogenic coding variants and differentiated alleles between ancestries are included in Supplementary Fig. 7 and Supplementary Table 11.
Understanding VUSs is currently a bottleneck in the interpretation of variation in clinically relevant genes and a challenge in clinical management 19 .Although VUSs have less empirical evidence for pathogenicity, they comprise the bulk of ClinVar, with more than one million variants.Notably, VUSs in regions of low MTR may be deleterious, comprising 5,079 (0.68%) VUSs in the top 1 percentile of MTR-constrained regions and 17,500 VUSs (2%) in the top 15 percentile (Supplementary Table 12).Using the MAPS-derived splicing score thresholds, we identified more than 11,000 candidate deleterious cryptic splice sites among VUSs (1,366 synonymous variants in 822 genes and 10,407 missense variants in 3,501 genes), offering potential insights into their functional consequences for clinical prioritization and interpretation efforts.

Discussion
The RGC-ME dataset, derived from 983,578 exomes, provides a harmonized catalogue of around 20 million coding variants in individuals from a diverse array of ancestries and is publicly accessible at https:// rgc-research.regeneron.com/me/home.Cataloguing variation at scale provides an opportunity to accurately estimate the frequency of rare variants-allowing us to precisely compute gene and regional constraint metrics, expand the compendium of rare human knockouts, annotate deleterious cryptic splice sites, characterize variant frequencies across different ancestries and assess the population prevalence of pathogenic variation.RGC-ME will be an invaluable resource for interpreting rare variants and is a step towards the realization of precision medicine.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-024-07556-0.  1 for sample size breakdowns.
Extended Data Fig. 3 | s het distribution across gene essentiality and disease categories, and evidence that a larger sample improves the precision of s het estimates.a, Proportion of genes in each disease/essentiality annotation list (gathered from published literature and databases) that correspond to each s het decile.b, Ratio of variances from MCMC sampling for s het calculated on full dataset (824k) and randomly downsampled set of 60,000 individuals.The mean ratio around 0.05 suggests that gene-level variance from the full dataset is 20x smaller than variance for the same gene using the downsampled set.This is the case despite similar, or even higher, s het mean estimates as shown by the colour bar.c, Mean (points) and 95% HPD (green bars) from MCMC sampling for s het estimates of genes in each CDS length quantile for downsampled 60k (left) and full 822K (right) samples.HPD and variance in panels b,c were derived from MCMC sampling for 8,000 final iterations.

Extended Data Fig. 4 | Comparisons between s het computed with RGC-ME
and other gene constraint metrics.a, Receiver operator curves showing the discrimination between the "high" and "low" constraint genes using different constraint metrics: s het-RGCME , s het-ABC , and LOEUF for transcripts with values from all three methods.The highly constrained category comprised 1,476 haploinsufficient and autosomal dominant (including developmental-specific) genes and the comparison group was represented by 3,893 haplosufficient and genes with rare, biallelic pLOFs in RGC-ME.Dotted lines indicate specificity and sensitivity for s het = 0.073.Sensitivity = TP / (TP+FN), Specificity = TN / (TN+FP).Spearman rank correlations between s het from RGC-ME with these estimates are high (-0.768with LOEUF, 0.778 with s het-ABC ).b, s het vs LOEUF results for 973 genes with ≤5 expected LOFs. 5 genes are highlighted for short coding sequence length (CDS), that are constrained according to s het (both mean>0.073and lower bound>0.021)and unconstrained according to LOEUF.LOEUF is an alternate measure of pLOF-based gene constraint and ranges from 0 to around 2; a value < 0.  1a for sample size breakdowns.The x-axis indicates ClinVar star rating which is a measure of the confidence of the annotation (pathogenic/benign).
Fig.2| Estimates of gene-level constraint, representing s het , from the RGC-ME dataset.a, Mean s het probability density for 16,710 canonical transcripts with 95% CIs calculated with 10,000 bootstrapped samples from the means of individual genes.b, Odds ratios (points) and 95% CIs (short horizontal lines; computed using standard error) for genes with s het cut-off > 0.073 (deemed highly constrained genes) to be included in each gene category listed on the

Fig. 3 |
Fig.3| Missense regional constraint captured by MTR.a, Odds ratio (OR) (points) and 95% CIs (error bars, two-sided Fisher's exact test) of ClinVar pathogenic versus benign variants in MTR ranking regions across the whole exome.Comparisons include MTR calculated using the 822K unrelated samples from the RGC-ME dataset on 31-codon (blue) and 21-codon (pink) windows, and MTR calculated using a random subset of 225,000 samples from the larger 822K samples using a 31-amino-acid sliding window (yellow).MTR values include variants for which the FDR is less than 0.1.A total of 21,047 benign ClinVar variants and 12,872 pathogenic ClinVar variants (two stars or more (CV2+)) were included.b, MTR ranking distribution of different protein functional regions and variant groups (in order: DNA-binding sites (n = 2,787), ClinVar pathogenic sites (n = 10,673), active sites (n = 2,787), transmembrane region (n = 2,787), localized to extracellular (n = 25,665), localized to cytoplasm (n = 32,994) and ClinVar benign sites (n = 20,739)).Box plot shows median and 25-75% interquartile range.The whisker minima and maxima represent the smallest and largest data points within 1.5× the interquartile range from the lower quartile and upper quartile, respectively.Every functional category was significantly more constrained than the category to its right with twosided Wilcoxon rank-sum test (least significant p-value = 2 × 10 −4 , Bonferroni correction).c, Distribution of the proportion of genes located in exome-wide top-15-percentile MTR regions against the heterozygous selection coefficient, s het .Genes with a significant proportion in the most constrained 15-percentile MTR region are coloured in pink and yellow (P < 0.05, Bonferroni-corrected, one-sided binomial tests with π 0 = 0.15), stratified by LOF constraint (s het = 0.073).Pink dots highlight genes that are tolerant of LOF, but have some regions depleted of missense variation.Blue lines indicate the density of genes with s het scores from 0 to 1 (right margin) and genes with a proportion of MTR in the top 15 percentile exome wide (above plot).d, MTR track of an oncogene, KRAS, a missense-specific constrained gene, along with the domain structure of the protein.The blue MTR-constrained region is defined by top-15-percentile exome-wide MTR rank.The N-terminal region containing amino acids 1-80 is depleted of missense variation, even though KRAS is tolerant of heterozygous LOF variation (s het = 0.002).

Fig. 4 |
Fig. 4 | Rare biallelic pLOF variants and 'human knockouts' in the RGC-ME dataset.a, Distribution of the number of individuals per pKO on the log 10 scale.Carriers of homozygous pLOFs and compound heterozygous variants were included in this analysis.b, Breakdown of the number of unique pKO genes observed in the RGC-ME dataset by ancestry.Both sets of rare biallelic variantshomozygous pLOFs and compound heterozygous-were included in this analysis.See Extended Data Table1afor a breakdown by ancestry of each type.c, Projected accrual of pKO genes using homozygous pLOF variant data at hypothetical cohort sizes for each ancestry in 983,578 related individuals.Curves reflect the accrual of the expected number of genes with at least one, at least five and at least ten carriers, respectively, of a rare, homozygous pLOF.Asterisk denotes the inclusion of cohorts with a high rate of consanguinity.

Extended Data Fig. 1 |
Mutation saturation survey in RGC-ME data.Counts are based on variant-transcript pairs.*Methylation level: mean methylation values across tissues of >0.65, 0.2-0.65,<0.2 correspond to methylation level of 2, 1, 0, respectively.CpG sites with methylation level of 2 are highly methylated, sites with methylation level 0 or 1 are grouped as lowly methylated sites.Variants are subset to target exome regions.Only variants that passed QC are included in the number of all possible variants.Extended Data Fig. 2 | Box plots summarizing counts of variants observed in each unrelated sample in the RGC-ME dataset.The lower bound, centre and upper bound of each box plot represents the 25, 50, and 75 percentiles of the distributions of counts.Points represent outliers, and whisker minima and maxima represent the smallest and largest points 1.5-times beyond the interquartile range.Individuals were assigned discrete ancestries based on overall fine-scale ancestry probabilities >50% and a total of 755,261 individuals were included.See Supplementary Table 35 is considered constrained.Error bars (grey lines) around s het denote 95% highest posterior density.Extended Data Fig. 5 | Determination of the MAPS score threshold to define constrained regions and SAVs.a-c, To systematically evaluate the deleteriousness of missense variants at various MTR scores, we compared MAPS scores across MTR (a) and splice score (b,c) thresholds.a, MTR was divided from 5% to 100% with step size of 5% (lower MTR percentile is more constrained).For each MTR percentile threshold, we divided variants into two sets: variants that pass the percentile threshold (red dots), and variants greater than the percentile (blue dots).The y axis represents MAPS score for each set of variants; the x-axis shows the tested MTR percentile threshold.Error bars represent standard deviation around the mean proportion of singletons per bin.A total of 68,636,473 variants were included in this analysis.b,c, Schematic of MAPS-derived filters for SpliceAI and MMSplice for all variants (b) and coding nonSS variants (c), respectively.A list of prediction score thresholds were set up for both SpliceAI and MMSplice, ranging from 0.1 to 0.99, with step size of 0.02.For each threshold, we divided variants into two sets: the set of variants that passed the threshold, represented as red dots, and the set of variants that failed the filter, represented as blue dots.The y axis represents MAPS score for each set of variants; the x-axis shows the tested score threshold.All MAPS scores were calculated based on the set of variants that pass QC metrics and have splice prediction scores in RGC-ME unrelated samples.Error bars represent standard deviation around the mean proportion of singletons per bin.For all variants (b), the total number of variants were n = 12,886,467 and 20,604,651 for SpliceAI and MMSplice, respectively.For coding nonSS variants (c), n = 5,468,779 and 4,013,820 for SpliceAI and MMSplice, respectively.Extended Data Fig. 6 | MTR-based segmentation and comparison of constrained regions in 2,832 genes that have constrained regions in both MTR and MPC (matched on transcript IDs).a, Dashed line indicates the top-15-percentile exome-wide MTR threshold used for MTR-based segmentation (0.841).Blue and red regions represent MTR-constrained and unconstrained regions, respectively.b, Constrained MTR regions that overlap with MPC segments.80% of MTR-constrained regions (represented by the combined area of red and yellow) overlap with MPC-constrained segments (yellow), whereas 40% of constrained MPC segments (represented by the combined area of green and yellow) are included in the intersection (yellow).The numbers indicated on the Venn diagram represent number of amino acids.Aside from 2,832 genes that had both MTR-and MPC-constrained regions, an additional 297 genes had only MPC-constrained segments, and 9,514 genes had only MTR-constrained regions (not included in the Venn diagram).c, Distribution of the fold change of the number of MTR-constrained regions and constrained MPC segments (γ ≤ 0.6) per gene.Dashed line indicates the median fold change of 3. Data shown is for the 2,832 genes that have constrained region annotations both in MTR and MPC segments.Extended Data Fig. 7 | Constrained regions are enriched in ClinVar pathogenic variants and shorter than LOF-specific constrained genes.a, Odds ratios (points) comparing enrichment of pathogenic versus benign ClinVar variants (solid line) and de novo variants in cases versus controls (dotted line) in MTR-constrained regions.Error bars represent 95% CIs (Fisher's exact test).The total number of variants in each comparison are shown in b for the MTR-percentile cut-offs highlighted in yellow.In total, 5,818 case and 553 control variants were used in the de novo analysis and 7,944 pathogenic and 11,993 benign variants were used in the ClinVar analysis.b, Table of case and control variants in constrained and unconstrained regions to compute statistical tests for ClinVar ("CV") and de novo ("DN") variants across 5 different MTR-percentile thresholds (13-17%, yellow boxed region in a).Statistics include hypergeometric tests (p-value for enrichment of case and control variants in constrained regions) and odds ratios comparing enrichment of case vs control in constrained regions.The background rate of constrained regions among variants in the comparison set represented by "% constrained background".c, CDS length comparison between 1,482 missense-specific constrained genes (defined where >15% of gene is in the top 15 percentile of MTR, based on one-sided binomial tests with π 0 = 0.15, p < 0.05, Bonferroni corrected) and 1,424 LOF-specific constrained genes with s het score <0.073.Log10 CDS length for all 19,644 genes (canonical transcripts) shown in the grey curve.The missense-specific constrained genes had significantly shorter CDS length than LOF-specific constrained genes (p = 2.9 × 10 −40 , two-sided Wilcoxon test).Extended Data Fig. 8 | Characteristics of predicted deleterious SAVs.a, Summary of unique variant-transcript pairs of predicted deleterious SAVs in different functional categories using MAPS-defined prediction score thresholds for SpliceAI and MMSplice.Percents are computed out of total variants in each effect class.b, Distribution of different functional categories of predicted deleterious SAVs.(HC: High confidence, LC: Low confidence.These annotation tags are derived from LOFTEE).c, Empirical validation of MAPS-predicted deleterious coding nonSS SAVs: enrichment of predicted coding nonSS deleterious SAVs in experimentally validated SDVs compared to non-SDVs.Odds ratios (points) were derived using two-sided Fisher's exact test and error bars show 95% CIs.A total of n = 17,395 variants, of which 147 SAVs were validated SDVs, were included.d,e, Fraction of predicted deleterious SAVs (d) and coding nonSS SAVs (e) that were validated as SDVs by any of the three splice reporter assays.Extended Data Fig. 9 | ClinVar variant counts.a, Counts of autosomal pathogenic ClinVar high-confidence variants (two stars or more) observed in large-scale exome sequencing studies, including RGC-ME, gnomAD (exomes, v2.1.1),and ExAC, are indicated on the top of each bar.The left axis indicates the total coverage of ClinVar pathogenic variants represented in each dataset.b, Bars and points both depict the mean per cent difference of perindividual counts in ClinVar categories (pathogenic 0+, 1+, 2+, VUS+CI) across continental ancestries using all unrelated samples, with respect to EUR (e.g.[count AFR -count EUR ]/count AFR × 100).CV0, 1, and 2 refer to ClinVar pathogenic star rating 0+, 1+, and 2+ categories, respectively.VUS/CI combines variants annotated as "variants of unknown significance" and "conflicting interpretations".All per-individual counts in non-EUR were significantly different compared to counts in EUR (e.g.per-individual counts of ClinVar 2+ variants in AFR were 0.576 [0.567, 0.585] lower than those in EUR) except for ClinVar 0+ counts in IAM compared with EUR (t-tests with Bonferroni correction).Error bars show 95% CI of mean per cent difference from t-test.c, Per-individual count of VUSs and conflicting information (CI, left); and pathogenic variants (right) in RGC-ME for all unrelated samples.The lower bound, centre and upper bound of each box plot represents the 25, 50, and 75 percentiles of the distributions of counts.Points represent outliers, and whisker minima and maxima represent the smallest and largest points 1.5-times beyond the interquartile range.For b,c, a total of 749,584 unrelated individuals were included across 4 ancestries and individuals were assigned discrete ancestries based on overall FSA probabilities >50%; see Supplementary Table

Identification of deleterious variants that are predicted to affect splicing
4) high-confidence splice sites and around 10% of variants within splice regions as predicted .a, MAPS across different functional categories.Error bars show standard deviation around the mean proportion of singletons (points).The yellow dashed line represents the SpliceAI and MMSplice score threshold for variants that have a MAPS score equal to that of 5/5 missense variants (predicted deleterious by five algorithms).Variants with a SpliceAI score ≥ 0.35 or a MMSplice score ≥ 0.97 are predicted deleterious SAVs.Noncoding variants refers to intronic, downstream (variant located 5′ of a gene), upstream (variant located 3′ of a gene) and 5′ and 3′ UTR variants captured by exome sequencing.Coding variants are inclusive of canonical splice sites, splice region and UTR splice sites.All variants that passed quality control and were observed in unrelated individuals in the RGC-ME dataset were included in this analysis (n = 34,512,842 variants).b, Enrichment of ClinVar pathogenic variants (two stars or more) in predicted SAVs compared with corresponding variant sets filtered by either LOFTEE, 5/5 missense deleteriousness models or CADD.Points represent odds ratios and bars depict 95% CIs (two-sided Fisher's exact test, no multiple testing correction).'All variants' include 313,390 coding and noncoding variants, and 'splice sites' include essential and UTR splice sites; counts of variants included in each calculation are provided in the table.HC, high-confidence.c, Empirical validation of MAPS-predicted deleterious SAVs (intersection set): enrichment of predicted deleterious SAVs in experimentally validated SDVs compared with non-SDVs.Points represent odds ratios and bars depict 95% CIs (two-sided Fisher's exact test, no multiple testing correction).A total of n = 36,636 variants, of which 346 SAVs are validated SDVs, are included.